SDS 323
James Scott
Reference: Data Science: A Gentle Introduction Chapter 5; Statistical Learning Ch 5.2
From the New England Journal of Medicine in 2006:
We randomly assigned patients with resectable adenocarcinoma of the stomach, esophagogastric junction, or lower esophagus to either perioperative chemotherapy and surgery (250 patients) or surgery alone (253 patients)…. With a median follow-up of four years, 149 patients in the perioperative-chemotherapy group and 170 in the surgery group had died. As compared with the surgery group, the perioperative-chemotherapy group had a higher likelihood of overall survival (five-year survival rate, 36 percent vs. 23 percent).
Conclusion:
Conclusion:
Not so fast! In statistics, we ask “what if?” a lot:
Conclusion:
Always remember two basic facts about samples:
Conclusion:
By “quantifying uncertainty” or “statistical inference,” we mean filling in the blanks.
In stats, we equate trustworthiness with stability:
\[ \begin{array}{r} \mbox{Confidence in} \\ \mbox{your estimates} \\ \end{array} \iff \begin{array}{l} \mbox{Stability of those estimates} \\ \mbox{under the influence of chance} \\ \end{array} \]
To assess stability, you have to contemplate the possibility of alternate data universes.
For example:
Let's work through a thought experiment…
Imagine Andrey Kolmorogov on four-day fishing trip.
At right we see the sampling distribution for both \( \beta_0 \) and \( \beta_1 \).
Suppose we are trying to estimate some population-level quantity \( \theta \): the parameter of interest.
So we take a sample from the population: \( X_1, X_2, \ldots, X_N \).
We use the data to form an estimate \( \hat{\theta}_N \) of the parameter. Key insight: \( \hat{\theta}_N \) is a random variable.
Suppose we are trying to estimate some population-level quantity \( \theta \): the parameter of interest.
So we take a sample from the population: \( X_1, X_2, \ldots, X_N \).
We use the data to form an estimate \( \hat{\theta}_N \) of the parameter. Key insight: \( \hat{\theta}_N \) is a random variable.
Now imagine repeating this process thousands of times! Since \( \hat{\theta}_N \) is a random variable, it has a probability distribution.
Estimator: any method for estimating the value of a parameter (e.g. sample mean, sample proportion, slope of OLS line, etc).
Sampling distribution: the probability distribution of an estimator \( \hat{\theta}_N \) under repeated samples of size \( N \).
Bias: Let \( \bar{\theta}_N = E(\hat{\theta}_N) \) be the mean of the sampling distribution. The bias of \( \hat{\theta}_N \) is \( (\bar{\theta}_N - \theta) \): the difference between the average answer and the truth.
Unbiased estimator: \( (\bar{\theta}_N - \theta) = 0 \).
Standard error: the standard deviation of an estimator's sampling distribution:
\[ \begin{aligned} \mbox{se}(\hat{\theta}_N) &= \sqrt{ \mbox{var}(\hat{\theta}_N) } \\ &= \sqrt{ E[ (\hat{\theta}_N - \bar{\theta}_N )^2] } \\ &= \mbox{Typical deviation of $\hat{\theta}_N$ from its average} \end{aligned} \]
“If I were to take repeated samples from the population and use this estimator for every sample, how much does the answer vary, on average?”
If an estimator is unbiased, then \( \bar{\theta}_N = \theta \), so
\[ \begin{aligned} \mbox{se}(\hat{\theta}_N) &= \sqrt{ E[ (\hat{\theta}_N - \bar{\theta}_N )^2] } \\ &= \sqrt{ E[ (\hat{\theta}_N - \theta )^2] } \\ &= \mbox{Typical deviation of $\hat{\theta}_N$ from the truth} \end{aligned} \]
“If I were to take repeated samples from the population and use this estimator for every sample, how big of an error do I make, on average?”
Don't make any lifestyle choices that require greater precision!
Don't reach any scientific conclusions that require greater precision!
But there's a problem here…
Two roads diverged in a yellow wood
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth…–Robert Frost, The Road Not Taken, 1916
Quantifying our uncertainty would seem to require knowing all the roads not taken—an impossible task.
Problem: we can't take repeated samples of size \( N \) from the population, to see how our estimate changes across samples.
Seemingly hacky solution: take repeated samples of size \( N \), with replacement, from the sample itself, and see how our estimate changes across samples. This is something we can easily simulate on a computer.
Basically, we pretend that our sample is the whole population and we charge ahead! This is called bootstrap resampling, or just bootstrapping.
Bootstrapped sample 1
Bootstrapped sample 2
Bootstrapped sample 3
Start with your original sample \( S = \{X_1, \ldots, X_N\} \) and original estimate \( \hat{\theta}_N \).
For \( b=1,...,B \):
Result: a set of \( B \) different estimates \( \hat{\theta}_N^{(1)}, \hat{\theta}_N^{(b)}, \ldots, \hat{\theta}_N^{(B)} \) that approximate the sampling distribution of \( \hat{\theta}_N \).
Calculate the bootstrapped standard error as the standard deviation of the bootstrapped estimates:
\[ \hat{se}(\hat{\theta}_N) = \mbox{std dev}\left( \hat{\theta}_N^{(1)}, \hat{\theta}_N^{(b)}, \ldots, \hat{\theta}_N^{(B)} \right) \]
This isn't the true standard error, but it's often a good approximation!
Let's dig in to some R code and data: creatinine_bootstrap.R and creatinine.csv (both on class website).
We'll bootstrap two estimators:
Informally, an interval estimate is a range of plausible values for the parameter of interest. For example:
\[ \theta \in \hat{\theta}_N \pm k \cdot \hat{se} (\hat{\theta}_N) \]
\[ \theta \in (q_{2.5}, q_{97.5}) \]
We'd like to be able to associate a confidence level with an interval estimate like this. How?
If an interval estimate satisfies the frequentist coverage principle, we call it a confidence interval:
Frequentist coverage principle: If you were to analyze one data set after another for the rest of your life, and you were to quote X% confidence intervals for every estimate you made, those intervals should cover their corresponding true values at least X% of the time. Here X can be any number between 0 and 100.
An interval estimate takes the form \( \hat{I}_N = [\hat{L}_N, \hat{U}_N] \). Just like a point estimate \( \hat{\theta}_N \), the interval estimate is a random variable, because its endpoints are functions of a random sample.
We say that \( [\hat{L}_N, \hat{U}_N] \) is a confidence interval at coverage level \( 1-\alpha \) if, for every \( \theta \),
\[ P_{\theta} \left( \theta \in [\hat{L}_N, \hat{U}_N] \right) \geq 1- \alpha \, , \]
where \( P_{\theta} \) is the probability distribution of the data, assuming that the true parameter is equal to \( \theta \).
The key statement here can be one of the most confusing in all of statistics:
\[ P_{\theta} \left( \theta \in [\hat{L}_N, \hat{U}_N] \right) \geq 1- \alpha \, , \]
Three questions to ask yourself:
So recall our two methods of generating an interval estimate using the bootstrap:
The obvious question is: do these interval estimates satisfy the frequentist coverage principle?
The answer is: not always, but often!
In lots of common situations, both forms of bootstrapped interval estimate approximately satisfy the coverage requirement:
\[ P_{\theta} \left( \theta \in [\hat{L}_N, \hat{U}_N] \right) \approx 1- \alpha \, , \]
And the approximation gets better with larger sample sizes. That is, as \( N \) gets large,
\[ P_{\theta} \left( \theta \in [\hat{L}_N, \hat{U}_N] \right) \to 1-\alpha \]
The math here is super hairy; we won't go into it. (Google “empirical process theory” if want to learn and you've got a year or two to spare…)
But we can run a sanity check through Monte Carlo simulation!
Let's revisit our thought experiment about fishing…
Let's go on a 100 fishing trips. On each trip:
Because we know the slope of the true line (\( \beta_1 = 4.25 \)), we can check whether each bootstrapped confidence interval contains the true value. About 80% of them should!
Sometimes we can use probability theory to calculate a “plug-in” estimate of an estimator's standard error. Some simple cases include:
Let's see an example and compare the result with a bootstrap estimate of the standard error.
Suppose that \( X_1, X_2, \ldots, X_N \) are a sample of independent, identically distributed (IID) random variables with unknown mean \( \mu \) and variance \( \sigma^2 \). Let \( \bar{X}_N \) be the sample mean:
\[ \bar{X}_N = \frac{1}{N} \sum_{i=1}^N X_i \]
Clearly \( \bar{X}_N \) is a sensible estimate of \( \mu \), since it is unbiased: \( E(\bar{X}_N) = \mu \) (show this!)
We can also calculate the theoretical variance of \( \bar{X}_N \) as:
\[ \begin{aligned} \mbox{var}(\bar{X}_N) = \mbox{var} \left( \frac{1}{N} \sum_{i=1}^N X_i \right) & = \frac{1}{N^2} \mbox{var} \left( \sum_{i=1}^N X_i \right) \\ &= \frac{1}{N^2} N \sigma^2 \\ &= \frac{\sigma^2}{N} \end{aligned} \]
This tells us that the true standard error of the sample mean is:
\[ \mbox{se}(\bar{X}_N) = \frac{\sigma}{\sqrt{N}} \]
Or in words:
\[ \small \mbox{Average error of the sample mean} = \frac {\mbox{Average error of a single measurement}} { \mbox{Square root of sample size} } \]
This is sometimes called de Moivre's equation, after Abraham de Moivre.
There's only one problem with de Moivre's equation: we don't know the true \( \sigma \)!
\[ \mbox{se}(\bar{X}_N) = \frac{\sigma}{\sqrt{N}} \]
The obvious solution is to estimate \( \sigma \) from the data. This results in the so-called “plug-in” estimate of the standard error:
\[ \hat{se}(\bar{X}_N) = \frac{\hat{\sigma}}{\sqrt{N}} \]
where \( \hat{\sigma} \) is an estimate of the population standard deviation (e.g. the sample standard deviation).
Suppose we have an estimator \( \hat{\theta}_N \) and we want to know its standard error. The general plug-in procedure involves three steps:
Let's see an example in predimed_plugin.R.
Let's turn back to supervised learning!
So we believe that \( y_i = f(x_i) + \epsilon_i \), we've collected some data, and we have our estimate \( \hat{f}(x) \)… How can we quantify our uncertainty for this estimate?
Fundamental frequentist thought experiment: “How might \( \hat{f}(x) \) have been different in an alternate data universe?”
But what does “alternate data universe” actually mean?
There's a version the bootstrap for all three situations:
Let's see all three versions of the bootstrap one by one.
If you've seen the bootstrap before, it was probably this one!
This leads to the following algorithm.
For b = 1 to B:
This gives us \( B \) draws from the bootstrapped sampling distribution of \( \hat{f}(x) \).
Use these draws to form (approximate) confidence intervals and standard errors for \( f(x) \).
library(tidyverse)
loadhou = read.csv('../data/loadhou.csv')
ggplot(loadhou) + geom_point(aes(x=KHOU, y=COAST), alpha=0.1) +
theme_set(theme_bw(base_size=18))
Suppose we want to know \( f(5) \) and \( f(25) \), i.e. the expected values of COAST when KHOU = 5 and KHOU = 25, respectively. Let's bootstrap a KNN model, with \( K=40 \):
library(mosaic)
library(FNN)
X_test = data.frame(KHOU=c(5,25))
boot20 = do(500)*{
loadhou_boot = resample(loadhou) # construct a boostrap sample
X_boot = select(loadhou_boot, KHOU)
y_boot = select(loadhou_boot, COAST)
knn20_boot = knn.reg(X_boot, X_test, y_boot, k=40)
knn20_boot$pred
}
head(boot20, 3) # first column is f(5), second is f(25)
V1 V2
1 10708.95 11882.67
2 10802.91 11997.82
3 10513.39 11812.59
Now we can calculate standard errors and/or confidence intervals.
se_hat = apply(boot20, 2, sd)
se_hat
V1 V2
172.5259 156.6715
apply(boot20, 2, quantile, probs=c(0.025, 0.975))
V1 V2
2.5% 10283.88 11532.70
97.5% 10974.92 12125.45
confint(boot20)
name lower upper level method estimate
1 V1 10283.88 10974.92 0.95 percentile 10638.78
2 V2 11532.70 12125.45 0.95 percentile 11906.23
X_test = data.frame(KHOU=seq(0, 35, by=0.1))
plot(COAST ~ KHOU, data=loadhou)
for(i in 1:500) {
loadhou_boot = resample(loadhou) # construct a boostrap sample
X_boot = select(loadhou_boot, KHOU)
y_boot = select(loadhou_boot, COAST)
knn20_boot = knn.reg(X_boot, X_test, y_boot, k=40)
knn20_boot$pred
lines(X_test$KHOU, knn20_boot$pred, col=rgb(1, 0, 0, 0.1))
}
This leads to the following algorithm. First fit the model, yielding \[ y_i = \hat{f}(x_i) + e_i \]
Then, for b = 1 to B:
ethanol = read.csv('ethanol.csv')
ggplot(ethanol) + geom_point(aes(x=E, y=NOx)) +
theme_set(theme_bw(base_size=18))
Key fact: the \( (x_i, y_i) \) are not random samples here! The \( x_i \) points are fixed as part of the experimental design.
Let's quantify uncertainty for \( f(0.7) \) and \( f(0.95) \), under 5th-order polynomial model, via the residual resampling bootstrap.
First, let's look at the empirical distribution of the residuals:
poly5 = lm(NOx ~ poly(E, 5), data=ethanol)
yhat = fitted(poly5)
evector = ethanol$NOx - yhat # empirical distribution of residuals
hist(evector, 20)
This is our estimate \( \hat{P}(e) \) for the probability distribution of the residuals.
Now we bootstrap:
X_test = data.frame(E=c(0.7, 0.95))
boot5 = do(500)*{
e_boot = resample(evector)
y_boot = yhat + e_boot # construct synthetic outcomes
ethanol_boot = ethanol
ethanol_boot$NOx = y_boot # substitute real outcomes with synthetic ones
poly5_boot = lm(NOx ~ poly(E, 5), data=ethanol_boot)
fhat_boot = predict(poly5_boot, X_test)
fhat_boot
}
head(boot5, 3) # first column is f(0.7), second is f(0.95)
X1 X2
1 1.479639 3.634581
2 1.576081 3.536532
3 1.506203 3.615962
As before, we can get standard errors and/or confidence intervals.
se_hat = apply(boot5, 2, sd)
se_hat
X1 X2
0.07574009 0.07527173
apply(boot5, 2, quantile, probs=c(0.025, 0.975))
X1 X2
2.5% 1.359714 3.390013
97.5% 1.653690 3.688565
Solution: simulate hypothetical “alternative data universes'' from a parametric model fitted to your data.
Example: see predimed_bootstrap.R
Suppose that you're trying to construct a portfolio: that is, to decide how to allocate your wealth among \( D \) financial assets. Things you want might to track include:
Key idea: use the bootstrap to simulate portfolio performance. (See portfolio.R)
Notation:
Let \( T \) be our investing horizon (e.g. T = 20 days, T = 40 years, etc), and let \( t \) index discrete time steps along the way.
Let \( X_{t,j} \) be the value of asset \( j = 1, \ldots, D \) at time period \( t \).
Let \( R_{t,j} \) be the return of asset \( j \) in period \( t \), so that we have the following recursive update:
\[ X_{t,j} = X_{t-1, j} \cdot (1 + R_{t,j}) \]
Notation:
A portfolio is a set of investment weights over assets: \( (w_{t1}, w_{t2}, \ldots, w_{tD}) \). Note: these weights might be fixed, or they might change over time.
The value of your portfolio is the weighted sum of the value of your assets:
\[ W_{t} = \sum_{j=1}^D w_{t,j} X_{t,j} \]
We care about \( W_T \): the random variable describing your terminal wealth after \( T \) investment periods.
Problem: this random variable is a super-complicated, nonlinear function of \( T \times D \) individual asset returns:
\[ W_T = f(R) \quad \mbox{where} \quad R = \{R_{t,j} : t = 1, \ldots, T; j = 1, \ldots, D\} \]
If we knew the asset returns, we could evaluate this function recursively, starting with initial wealth \( W_0 \) at time \( t=0 \) and sweeping through time \( t=T \):
Starting with initial wealth \( X^{(i)}_{1, j} \) in each asset, we sweep through from \( t=1 \) to \( t=T \):
\[ \begin{aligned} X^{(f)}_{t, j} &= X^{(i)}_{t, j} \cdot (1 + R_{t,j}) &\quad \mbox{(Update each asset)} \\ W_{t} &= \sum_{j=1}^D w_{t,j} X^{(f)}_{t,j} &\quad \mbox{(Sum over assets)} \\ X^{(i)}_{t+1,j} &= w_{t+1,j} \cdot W_{t} &\quad \mbox{(Rebalance)} \\ \end{aligned} \]
But of course, we don't know the asset returns! This suggests that we should use a Monte Carlo simulation, where we repeat the following for loop many times.
For \( t = 1, \ldots, T \):
The precise math of the update and rebalance steps are on the previous slide.
The difficult step here is (1): simulate a high-dimensional vector of asset returns from its joint probability distribution.
In general, using simple parametric probability models (e.g. multivariate Gaussian) to describe high-dimensional joint distributions is a very dicey proposition.
Suppose we have \( M \) past samples of the asset returns, stacked in a matrix:
\[ R = \left( \begin{array}{r r r r} R_{11} & R_{12} & \cdots & R_{1D} \\ R_{21} & R_{22} & \cdots & R_{2D} \\ \vdots & && \\ R_{M1} & R_{M2} & \cdots & R_{MD} \end{array} \right) \]
where \( R_{tj} \) is the return of asset \( j \) in period \( t \).
The key idea of bootstrap resampling is the following:
Thus our Monte Carlo simulation looks like the following at each draw.
For \( t = 1, \ldots, T \):
Let's go to the R code! See portfolio.R on the website.
Why do we draw an entire row of \( R \) at a time?